#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'
Coding Project: Final Draft¶
# Enter your code here
import pandas as pd
import numpy as np
df = pd.read_csv("census.csv")
df
| state | county | fips | population | median_age | pct_over_65 | median_income | per_capita_income | pct_white | pct_black | pct_asian | pct_hispanic | pct_college | unemployment | pop_density | is_metro | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Autauga County | 1001 | 55380 | 38.2 | 15.0 | 58731 | 29819 | 76.8 | 19.0 | 1.0 | 2.8 | 26.6 | 3.5 | 91.8 | 1.0 | Southeast |
| 1 | Alabama | Baldwin County | 1003 | 212830 | 43.0 | 20.0 | 58320 | 32626 | 86.2 | 9.3 | 0.9 | 4.6 | 31.9 | 4.0 | 114.6 | 1.0 | Southeast |
| 2 | Alabama | Barbour County | 1005 | 25361 | 40.4 | 18.6 | 32525 | 18473 | 46.8 | 47.6 | 0.5 | 4.4 | 11.6 | 9.4 | 31.0 | 0.0 | Southeast |
| 3 | Alabama | Bibb County | 1007 | 22493 | 40.9 | 15.9 | 47542 | 20778 | 76.8 | 22.3 | 0.1 | 2.6 | 10.4 | 7.0 | 36.8 | 1.0 | Southeast |
| 4 | Alabama | Blount County | 1009 | 57681 | 40.7 | 17.9 | 49358 | 24747 | 95.5 | 1.6 | 0.4 | 9.3 | 13.1 | 3.1 | 88.9 | 1.0 | Southeast |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3137 | Wyoming | Sweetwater County | 56037 | 43521 | 35.3 | 11.4 | 74843 | 32603 | 93.4 | 1.2 | 0.8 | 15.9 | 22.5 | 5.7 | 4.2 | 0.0 | West |
| 3138 | Wyoming | Teton County | 56039 | 23280 | 39.3 | 14.0 | 84678 | 54051 | 89.3 | 1.2 | 1.3 | 15.0 | 57.0 | 0.7 | 5.3 | 0.0 | West |
| 3139 | Wyoming | Uinta County | 56041 | 20479 | 35.8 | 13.0 | 63403 | 28159 | 93.4 | 0.1 | 0.2 | 9.1 | 16.0 | 5.5 | 10.1 | 0.0 | West |
| 3140 | Wyoming | Washakie County | 56043 | 8027 | 42.9 | 21.1 | 54158 | 28101 | 89.7 | 0.0 | 0.0 | 14.2 | 23.4 | 4.1 | 3.8 | 0.0 | West |
| 3141 | Wyoming | Weston County | 56045 | 7049 | 43.1 | 19.4 | 57031 | 28531 | 97.4 | 0.2 | 0.8 | 1.1 | 20.0 | 4.0 | 3.0 | 0.0 | West |
3142 rows × 17 columns
2. Run the command to get a scatterplot matrix (pandas.plotting.scatter_matrix()). Note: if you have a large number of columns in your dataset, you should select the 5-10 relevant columns to plot.¶
# Enter your code here
pd.plotting.scatter_matrix(df[['median_income', 'median_age','pct_college', 'population', 'unemployment', 'is_metro', 'region']])
array([[<Axes: xlabel='median_income', ylabel='median_income'>,
<Axes: xlabel='median_age', ylabel='median_income'>,
<Axes: xlabel='pct_college', ylabel='median_income'>,
<Axes: xlabel='population', ylabel='median_income'>,
<Axes: xlabel='unemployment', ylabel='median_income'>,
<Axes: xlabel='is_metro', ylabel='median_income'>],
[<Axes: xlabel='median_income', ylabel='median_age'>,
<Axes: xlabel='median_age', ylabel='median_age'>,
<Axes: xlabel='pct_college', ylabel='median_age'>,
<Axes: xlabel='population', ylabel='median_age'>,
<Axes: xlabel='unemployment', ylabel='median_age'>,
<Axes: xlabel='is_metro', ylabel='median_age'>],
[<Axes: xlabel='median_income', ylabel='pct_college'>,
<Axes: xlabel='median_age', ylabel='pct_college'>,
<Axes: xlabel='pct_college', ylabel='pct_college'>,
<Axes: xlabel='population', ylabel='pct_college'>,
<Axes: xlabel='unemployment', ylabel='pct_college'>,
<Axes: xlabel='is_metro', ylabel='pct_college'>],
[<Axes: xlabel='median_income', ylabel='population'>,
<Axes: xlabel='median_age', ylabel='population'>,
<Axes: xlabel='pct_college', ylabel='population'>,
<Axes: xlabel='population', ylabel='population'>,
<Axes: xlabel='unemployment', ylabel='population'>,
<Axes: xlabel='is_metro', ylabel='population'>],
[<Axes: xlabel='median_income', ylabel='unemployment'>,
<Axes: xlabel='median_age', ylabel='unemployment'>,
<Axes: xlabel='pct_college', ylabel='unemployment'>,
<Axes: xlabel='population', ylabel='unemployment'>,
<Axes: xlabel='unemployment', ylabel='unemployment'>,
<Axes: xlabel='is_metro', ylabel='unemployment'>],
[<Axes: xlabel='median_income', ylabel='is_metro'>,
<Axes: xlabel='median_age', ylabel='is_metro'>,
<Axes: xlabel='pct_college', ylabel='is_metro'>,
<Axes: xlabel='population', ylabel='is_metro'>,
<Axes: xlabel='unemployment', ylabel='is_metro'>,
<Axes: xlabel='is_metro', ylabel='is_metro'>]], dtype=object)
Feature Engineering¶
3. Do some feature engineering with your data...I'd like you to do at least 2 of the following:¶
- Fix the data type if any of your variables were read in incorrectly (e.g., a categorical variable is being read in as an integer/float or a date being read in as an object)¶
- Transform variables as needed (e.g., create a squared term if there is curvature in a scatterplot, log a variable that is positive and skewed)¶
- Create an interaction term if you believe it makes sense to combine 2 variables¶
- Create a categorical variable from a numeric variable (e.g., create True/False categories or use numpy.select() to create bins)¶
Add as many code cells as needed and write a few sentences below to explain why you're making these changes¶
I will create a categorical variable to identify affluent towns. This will be defined as towns where the median income is in the top quartile.
I will create a column that shows the percentage of the population that is non-white. NGF research says this segment of the population is currently driving the growth of golf, so it could be helpful to keep track of this percentage directly.
Lastly, I will create a column that takes a combination of income, education, diversity, and population and assigns a weight to each. This composite column will represent the opportunity score for golf marketing efforts.
df["affluent"] = df["median_income"] > 59867.250000
df.head()
| state | county | fips | population | median_age | pct_over_65 | median_income | per_capita_income | pct_white | pct_black | pct_asian | pct_hispanic | pct_college | unemployment | pop_density | is_metro | region | affluent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Autauga County | 1001 | 55380 | 38.2 | 15.0 | 58731 | 29819 | 76.8 | 19.0 | 1.0 | 2.8 | 26.6 | 3.5 | 91.8 | 1.0 | Southeast | False |
| 1 | Alabama | Baldwin County | 1003 | 212830 | 43.0 | 20.0 | 58320 | 32626 | 86.2 | 9.3 | 0.9 | 4.6 | 31.9 | 4.0 | 114.6 | 1.0 | Southeast | False |
| 2 | Alabama | Barbour County | 1005 | 25361 | 40.4 | 18.6 | 32525 | 18473 | 46.8 | 47.6 | 0.5 | 4.4 | 11.6 | 9.4 | 31.0 | 0.0 | Southeast | False |
| 3 | Alabama | Bibb County | 1007 | 22493 | 40.9 | 15.9 | 47542 | 20778 | 76.8 | 22.3 | 0.1 | 2.6 | 10.4 | 7.0 | 36.8 | 1.0 | Southeast | False |
| 4 | Alabama | Blount County | 1009 | 57681 | 40.7 | 17.9 | 49358 | 24747 | 95.5 | 1.6 | 0.4 | 9.3 | 13.1 | 3.1 | 88.9 | 1.0 | Southeast | False |
df["diversity"] = 100 - df["pct_white"]
Normalize each component to 0-1¶
income_score = (df['median_income'] / df['median_income'].max()).clip(0, 1) education_score = (df['pct_college'] / df['pct_college'].max()).clip(0, 1) diversity_score = (df['diversity'] / df['diversity'].max()).clip(0, 1) size_score = (np.log1p(df['population']) / np.log1p(df['population'].max())).clip(0, 1)
Younger counties get a boost, but only if also affluent¶
peaks around age 30-35¶
age_score = (50 - df['median_age']).clip(0, 20) / 20
Weighted sum¶
df['opportunity_score'] = ( income_score * 35 + education_score * 25 + diversity_score * 15 + size_score * 15 + age_score * 10 ).round(1)
df.head()
Inferential Modeling¶
4. Build a regression model (either OLS, Logistic, or Poisson) explaining the relationships between 1 dependent variable of interest and 2 or more independent variables...be sure to build the appropriate model type based on your dependent variable¶
import statsmodels.formula.api as smf
results = smf.ols("median_income ~ median_age + pct_college + population + unemployment + diversity", data = df).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: median_income R-squared: 0.556
Model: OLS Adj. R-squared: 0.555
Method: Least Squares F-statistic: 785.8
Date: Sun, 22 Feb 2026 Prob (F-statistic): 0.00
Time: 17:11:31 Log-Likelihood: -33221.
No. Observations: 3142 AIC: 6.645e+04
Df Residuals: 3136 BIC: 6.649e+04
Df Model: 5
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 4.686e+04 1657.753 28.268 0.000 4.36e+04 5.01e+04
median_age -170.9951 33.829 -5.055 0.000 -237.323 -104.667
pct_college 905.3288 20.031 45.197 0.000 866.054 944.603
population 0.0030 0.001 5.505 0.000 0.002 0.004
unemployment -1134.7679 78.488 -14.458 0.000 -1288.661 -980.875
diversity -52.6311 12.394 -4.247 0.000 -76.932 -28.330
==============================================================================
Omnibus: 310.165 Durbin-Watson: 1.569
Prob(Omnibus): 0.000 Jarque-Bera (JB): 873.272
Skew: 0.535 Prob(JB): 2.35e-190
Kurtosis: 5.351 Cond. No. 3.41e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.41e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
5. Based on your model output, interpret the coefficients for at least 2 statistically significant variables, making sure to quantify the relationships between those variables and your dependent variable. If you don't have 2 statistically significant variables in your model, provide an explanation of what you expected would be significant and any theories as to why it might not be significant in your model¶
pct_college = 931.9263. For each 1 percent increase in college educated population, median income increases by 931.9263 dollars. unemployment = -1263.5811. For each 1 percent increase in the unemployed population, median income decreases by -1263.5811 dollars.
6. Write a few sentences about your conclusions from the model. What was interesting or surprising about your results? Does your model confirm any hypotheses you might have had about your data? Does it change any of your hypotheses?¶
The relationship between education and income makes sense, because people with college degrees tend to have higher-paying jobs than those without. And the relationship between unemployment and median_income seems obvious too. I had expected the age variable to have a positive, significant relationship with income, but the model shows a negative, significant relationship. I believe my mistake comes from not taking into account the older, retired segment of the population with reduced income. In reality, the median income probably has a maximum when the median age is somewhere under retirement age, and the population is closer to their maximum earning potential.
# Define thresholds
high_affluence = df['affluent'] == True
high_diversity = df['diversity'] >= df['diversity'].median() # or pick a threshold like 30
# Count counties that are both
both = (high_affluence & high_diversity).sum()
pct_both = (both / len(df) * 100).round(1)
print(f"Counties that are both affluent AND high diversity: {both} ({pct_both}%)")
Counties that are both affluent AND high diversity: 426 (13.6%)
outcome = df['affluent']
8. Pick at least 5 other variables to include as features in your model. Create a features variable, making sure to create dummy variables for any categorical data you're including...add as many code cells as needed below¶
numericFeatures = df[['pct_college', 'median_age', 'diversity', 'population', 'unemployment', 'is_metro']]
dummies = pd.get_dummies(df[['region']], drop_first=True)
features = pd.concat([numericFeatures, dummies], axis=1)
9. Partition your data into training and test data¶
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features,
outcome,
test_size = 0.33,
random_state = 42)
10. Build 1 predictive machine learning model (Decision Tree, Random Forest, or Support Vector Machine) predicting your outcome variable from the features you've selected. If your outcome variable is categorical, you should use one of the Classifier models covered in the Week 8 hands-on videos. If your outcome variable is numeric, you should use a Regressor model...refer to the ML_RegressorModels.ipynb file in Jupyter Hub for that code¶
import sklearn.ensemble
modelForest = sklearn.ensemble.RandomForestClassifier(random_state = 42)
# 2. Fit the model using the training data
resultForest = modelForest.fit(featuresTrain, outcomeTrain)
# 3. Predict outcomes from the training and testing data
predForestTrain = modelForest.predict(featuresTrain)
predForestTest = modelForest.predict(featuresTest)
# 4. Assess the fit
print(sklearn.metrics.classification_report(outcomeTrain, predForestTrain))
cmForestTest = sklearn.metrics.confusion_matrix(outcomeTest, predForestTest)
sklearn.metrics.ConfusionMatrixDisplay(cmForestTest, display_labels = modelForest.classes_).plot()
print(sklearn.metrics.classification_report(outcomeTest, predForestTest))
precision recall f1-score support
False 1.00 1.00 1.00 1574
True 1.00 1.00 1.00 531
accuracy 1.00 2105
macro avg 1.00 1.00 1.00 2105
weighted avg 1.00 1.00 1.00 2105
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x77feaaa63a40>
precision recall f1-score support
False 0.88 0.94 0.91 782
True 0.76 0.61 0.68 255
accuracy 0.86 1037
macro avg 0.82 0.77 0.79 1037
weighted avg 0.85 0.86 0.85 1037
!pip install plotly
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: plotly in /home/jupyter-welchcoo/.local/lib/python3.12/site-packages (6.5.2) Requirement already satisfied: narwhals>=1.15.1 in /home/jupyter-welchcoo/.local/lib/python3.12/site-packages (from plotly) (2.16.0) Requirement already satisfied: packaging in /opt/tljh/user/lib/python3.12/site-packages (from plotly) (24.1)
# Normalize each component to 0-1
income_score = (df['median_income'] / df['median_income'].max()).clip(0, 1)
education_score = (df['pct_college'] / df['pct_college'].max()).clip(0, 1)
diversity_score = (df['diversity'] / df['diversity'].max()).clip(0, 1)
size_score = (np.log1p(df['population']) / np.log1p(df['population'].max())).clip(0, 1)
age_score = ((50 - df['median_age']).clip(0, 20) / 20) # younger counties score higher
# Weighted sum
df['opportunity_score'] = (
income_score * 35 +
education_score * 25 +
diversity_score * 15 +
size_score * 15 +
age_score * 10
).round(1)
from plotnine import *
breaks = np.arange(0, 160000, 20000)
labels = ['$' + str(int(x/1000)) + 'k' for x in breaks]
a = (ggplot(df, aes(x='median_income', y='opportunity_score', size='population', color='population'))
+ geom_point(alpha=0.3)
+ labs(title='Opportunity Increases with Median Income',
x='Median Household Income',
y='Opportunity Score',
size = 'Population')
+ theme_minimal()
+ scale_x_continuous(breaks = breaks, labels = labels)
)
b = (ggplot(df, aes(x='region', y='opportunity_score', fill='region'))
+ geom_boxplot()
+ labs(title='Northeast and West Lead in Opportunity Score',
x='Region',
y='Opportunity Score')
+ theme_minimal()
)
c= (ggplot(df, aes(x='median_age', y='opportunity_score', color='affluent', size='population'))
+ geom_point(alpha=0.5)
+ labs(title='High Opportunity Scores Peak in Affluent Counties',
x='Median Age',
y='Opportunity Score',
color='')
+ scale_color_manual(labels = ["Non-affluent", "Affluent"],
values = ["lightgreen", "darkgreen"])
+ theme_minimal()
)
# Get top 100 counties
top100 = df.nlargest(100, 'opportunity_score')
top1000 = df.nlargest(1000, 'opportunity_score')
d = (ggplot(top100, aes(x='region', fill='region'))
+ geom_bar()
+ labs(title='Southeast Has The Most Counties in the Top 100',
x='',
y='Counties in Top 100',
fill = '')
+ theme_minimal()
+ theme(legend_position='none')
)
# Count by state and get top 15 states
state_counts = top1000['state'].value_counts().head(15).reset_index()
state_counts.columns = ['state', 'count']
e = (ggplot(state_counts, aes(x='reorder(state, count)', y='count', fill='state'))
+ geom_bar(stat='identity')
+ coord_flip()
+ labs(title='Texas and Virginia Lead in High Opportunity counties',
x='',
y='Counties in top 1000')
+ theme_minimal()
+ theme(legend_position='none')
)
#ggsave(a, "OppInc.png")
#ggsave(b, "RegOpp.png")
#ggsave(c, "OppAff.png")
#ggsave(d, "top100.png")
#ggsave(e, "top1000.png")
a
b
c
d
e
df['high_opportunity'] = df['opportunity_score'] >= 50
f = (ggplot(df, aes(x='region', fill='high_opportunity'))
+ geom_bar(position='fill')
+ labs(title='Concentration of High Opportunity Counties',
x='',
y='Proportion ',
fill='High Opportunity')
+ scale_fill_manual(values=['lightgreen', 'darkgreen'],
labels=['No', 'Yes'])
+ theme_minimal()
)
ggsave(f, "topprop.png")
f
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:623: PlotnineWarning: Saving 6.4 x 4.8 in image. /opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:624: PlotnineWarning: Filename: topprop.png
# Create a flag for counties that are both affluent AND high diversity
df['both'] = (df['affluent'] == True) & (df['diversity'] >= df['diversity'].median())
g = (ggplot(df, aes(x='median_income', y='diversity', color='both'))
+ geom_point(alpha=0.5)
+ labs(title='The Intersection of Affluence and Diversity',
x='Median Income',
y='Diversity Index',
color='High on Both')
+ scale_color_manual(values=['gray', 'blue'],
labels=['No', 'Yes (13.6%)'])
+ theme_minimal()
)
ggsave(g, "AffDiv.png")
g
/opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:623: PlotnineWarning: Saving 6.4 x 4.8 in image. /opt/tljh/user/lib/python3.12/site-packages/plotnine/ggplot.py:624: PlotnineWarning: Filename: AffDiv.png
df.nlargest(10, 'opportunity_score')[['state', 'county', 'opportunity_score', 'median_income', 'pct_college', 'diversity', 'population']]
| state | county | opportunity_score | median_income | pct_college | diversity | population | |
|---|---|---|---|---|---|---|---|
| 2872 | Virginia | Loudoun County | 79.1 | 142299 | 61.3 | 35.0 | 395134 |
| 2826 | Virginia | Arlington County | 77.4 | 120071 | 75.3 | 28.5 | 233464 |
| 228 | California | Santa Clara County | 75.9 | 124055 | 52.4 | 55.5 | 1927470 |
| 2848 | Virginia | Fairfax County | 75.6 | 124831 | 61.6 | 38.8 | 1145862 |
| 2925 | Virginia | Falls Church city | 74.1 | 127610 | 77.6 | 20.6 | 14128 |
| 1205 | Maryland | Howard County | 74.0 | 121160 | 62.6 | 43.0 | 318855 |
| 223 | California | San Francisco County | 73.3 | 112449 | 58.1 | 53.6 | 874961 |
| 226 | California | San Mateo County | 72.0 | 122641 | 51.0 | 49.4 | 767423 |
| 1207 | Maryland | Montgomery County | 71.3 | 108820 | 58.9 | 46.9 | 1043530 |
| 319 | District of Columbia | District of Columbia | 69.7 | 86420 | 58.5 | 58.7 | 692683 |
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'
state_avg = df.groupby('state')['opportunity_score'].mean().reset_index()
state_abbrev = {
'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
'District of Columbia': 'DC'
}
state_avg['state_abbrev'] = state_avg['state'].map(state_abbrev)
fig = px.choropleth(state_avg,
locations='state_abbrev',
locationmode='USA-states',
color='opportunity_score',
scope='usa',
title='Coast to Coast Opportunities',
color_continuous_scale='RdYlGn',
range_color=[10, state_avg['opportunity_score'].max()],
labels={'opportunity_score': 'Opportunity Score'})
fig.update_layout(
width=1000,
height=600)
fig.show()
11. Assess the strength of your model on the training vs. test data. For classifier models, you should assess the models using the accuracy %. For regressor models, you should use root mean squared error (RMSE) as shown in the ML_RegressorModels.ipynb file referenced above.¶
Write a few sentences quantifying your model's strength...in your explanation, be sure to address whether or not you see evidence of overfitting¶
The random forest model has 86% accuracy on the test data. Since this is 15% less than it's accuracy on the training data, there is some overfitting happening. Looking at the confusion matrix, it does a better job of predicting non-affluent counties (94% recall) than affluent ones (61% recall).
Next Steps¶
12. Based on your model results, what do you think your next steps would be? Does the model seem strong enough to implement as is? Does it make sense to try other models? Do you think other data might strengthen your models if you had access to it?¶
Note: you do not need to create additional models for this assignment. But, prior to submitting your final memo, you should try out other models to see if you can build a stronger model.¶
Despite the tendencies discussed above, the 86% test accuracy means the model does a relatively good job of predicting which counties have median incomes in the top quartile, based on other demographic features. It could also make sense to try a regression model, using median_income itself as the outcome variable. In addition, some more detailed demographic data (post-grad education, median home value, school district ratings) would almost certainly strengthen the model.